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Abstract 



In this paper we present a comprehensive framework for learning ro- 
C/^ bust low-rank representations by combining and extending recent ideas 

for learning fast sparse coding regressors with structured non-convex opti- 
mization techniques. This approach connects robust principal component 
analysis (RPCA) with dictionary learning techniques and allows its ap- 
^ proximation via trainable encoders. We propose an efficient feed-forward 

CO architecture derived from an optimization algorithm designed to exactly 

solve robust low dimensional projections. This architecture, in combina- 
tion with different training objective functions, allows the regressors to 
be used as online approximants of the exact offline RPCA problem or 
^\ as RPCA-based neural networks. Simple modifications of these encoders 

can handle challenging extensions, such as the inclusion of geometric data 
transformations. We present several examples with real data from image, 
audio, and video processing. When used to approximate RPCA, our basic 
^ implementation shows several orders of magnitude speedup compared to 

• 1^ the exact solvers with almost no performance degradation. We show the 

strength of the inclusion of learning to the RPCA approach on a music 
source separation application, where the encoders outperform the exact 
RPCA algorithms, which are already reported to produce state-of-the-art 
results on a benchmark database. Our preliminary implementation on an 
iPad shows faster-than-real-time performance with minimal latency. 



1 Introduction 

Principal component analysis (PCA) is the most widely used statistical tech- 
nique for dimensionality reduction, with applications ranging from machine 

* Authors' email addresses: pablo.sprechiiiaim@duke.edu, broii@eng.tau.ac.il, and 
guillermo . sapiro@duke . edu. 
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learning and computer vision to signal processing and bioinformatics, just to 
mention a few. Given a data matrix X G M^><"^ (each column of X is an m- 
dimensional data vector), it is decomposed as X = L + N, where L is a low 
rank matrix and N is a perturbation matrix. PCA is known to produce very 



good results when the perturbation is small Jolliffe (2002). However, its per- 



formance is highly sensitive to the presence of samples not following the model; 
even a single outlier in the data matrix X can render the estimation of the 
low rank component arbitrarily far from the true matrix L. This motivated an 



important amount of work dedicated to robustifying PCA, see [Torre fc Black 
(2003); Candes et al. (2011) for recent work and references therein for previ 



ous results. In a series of recent works Candes et al. (2011); Xu et al. (2010) 



a very elegant solution to this problem was developed, in which the low rank 
matrix is determined as the minimizer of a convex program. The basic idea is 
to add a new term in the decomposition to account for the presence of outliers, 
X = L + N + O, where O is an error matrix with a sparse number of non-zero 
coefficients with arbitrarily large magnitude. Robust PCA is then obtained by 
solving 

min ||LL+A||0||i s.t. ||X - L - 0||^ < e, (1) 



L,OGl 



where ||L||^ denotes the matrix nuclear norm, defined as the sum of the singular 
values of L (the convex surrogate of the rank), A is a positive scalar parameter 
controlling the sparsity level in the outliers, and e is a parameter controlling 
the error of the approximation. In the noiseless setting, the constraint is often 
substituted by the equality X = L + O Candes et al. (2011). 



This particular formulation of robust PCA has attracted significant interest 
in the machine learning, computer vision, and signal processing communities, 
and was successfully used in applications such as face recognition and modeling 
Wagner et al.|(|2011|);|Peng et al.|(|2010|), background modelinglZhou et al.|(|2010|); 



Qiu & Vaswani (2011 ), large scale image tag transduction Mu et al. (2011 ), and 



audio source separation Huang et al. (2012). A challenge often encountered in 



modern applications is that the fiow of new input data is permanent. Then, 
the robust low rank model needs to be adapted constantly since the principal 
directions can change over time, calling for developing efficient online techniques 



Qiu fc Vaswanil (|2011[ ); |Wai-tian et al.| ( |2011[ ); [Mateos fc Giannakis] ( |2011[ ); 
Balzano et al.|(|2010D. 



Significant amount of effort has been devoted to developing optimization 
algorithms for efficiently solving (2) and its noiseless formulation. First-order 
techniques based on proximal methods ^Candes et al. (2011); Cai et al. (2010j); 



Lin et al. 



(2009), and augmented Lagrangian approaches 



Ma et al. 



(2011) have 



shown to be fast and effective when the data size is moderate. More recent 
efforts proposed methods using random projections with drastically reduced 



scale Mu et al. (2011), or by decomposing (2) into a non-convex structured 
problem of significantly reduced size Recht fc Re ( |2011 ); Mateos fc Giannakis 
( [2Q11J . Despite the permanent progress reported in the literature, state-of-the- 
art algorithms for solving this problem still have prohibitive complexity and 
latency for real-time processing. 
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In the sparse coding domain, a very similar situation was encountered a few 
years ago. Techniques based on sparse representations in learned over-complete 
dictionaries produced outstanding results in various computer vision and sig- 
nal processing problems, but they are often prohibitively costly for real-time 
computation. This motivated significant effort in the deep learning community 
aiming at overcoming this problem. On one hand, several works concentrated 
on proposing systems capable of producing sparse codes, aiming to bring the 
success of the exact sparse coding algorithms to extremely efficient deep learn- 
ing schemes, e.g., Ranzato et al. (2007); Goodfellow et al. (2009). In a different 
approach, several works proposed learning non-linear regressors capable of pro- 
ducing good approximations of the true sparse codes in a fixed amount of time 



Jarrett et al. (2009); Kavukcuoglu et al. (2010). The insightful work in Gregor 



& LeCun (2010) introduced an approach in which the regressors are multilayer 



artificial neural networks with an architecture inspired by first order optimiza- 
tion algorithms for solving sparse coding problems. These regressors are trained 
to minimize the mean squared error between the predicted and exact codes over 
a given training set, and produced high quality approximations of sparse codes 
for vectors following the same distribution as the training sample. 

Motivated by the latter approach, in this paper we propose to extend these 
ideas to the RPCA context. We propose to design regressors capable of ap- 
proximating online RPCA in a very fast way. To the best of our knowledge, 
this type of encoders have never been developed before. We follow [Gregor 



LeCun ( |2010| ) and base the architecture of the encoders on the iterations of 



exact RPCA algorithms. However, unlike the standard sparse coding setting, 
the exact first order RPCA algorithms cannot be used directly, as each iteration 
involves SVD. As a remedy, we use an algorithm inspired by the non-convex op- 
timization techniques proposed in Recht & Re ( 2011| ). Our RPCA encoders are 
learned by minimizing various carefully chosen objective functions that allow 
their use in several different contexts as explained in the sequel. 

Learning encoders to approximate RPCA becomes particularly relevant when 
the low rank model has to be re-computed or updated constantly throughout 
time. We propose a training objective function that allows the encoders to be 
trained in an online manner on the very same data vectors fed to them. This also 
makes the fast encoders no more restricted to work with a specific distribution 



of input vectors known a priori (limitation existing, for example, in Gregor & 
LeCun] ( |2QTq1 )), and removes the need to run the exact algorithms beforehand. 



This approach is related to the sparse autoencoders Goodfellow et al. (2009), as 
will be further discussed in Section l3Jl 

Several applications of RPCA rely on the critical assumption that the given 
input vectors are aligned with respect to a group of geometric transformations 



Peng et al. (2010). The state-of-the art techniques for addressing this problem 



involve the computation of several RPCA problems by changing the individual 
transformations applied to each input data vector. The differentiability of the 
proposed encoders with respect to the input, output and training data allows a 
very simple incorporation of geometric transformations. We propose a learning 
setting for that important case as well. 
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Finally, in many applications RFC A is applied to signals not exactly follow- 
ing the low rank model with sparse additive outliers. A clear example is the 
problem of separating the leading singing voice from the musical background 
from a monaural recording, as detailed in Section |4] In Huang et al.| ([2012|, 
the authors obtained state-of-the-art results in this problem using RFCA in the 
time-frequency domain by modeling the repetitive structure of the accompani- 
ment as a low-rank linear model and the singing voice as sparse outliers. It 
is clear, however, that using a richer model for representing the singing voice 
(e.g., the harmonic structure makes the voice patterns highly structured) would 
produce a better performance. We propose to fill in the gap between the RFCA 
model and the real signals by incorporating learning, that is, by changing the 
training objective function of our RFCA encoders so that they approximate the 
desired source separation. Experimental evaluation shows the benefit of this 
approach, which serves as an illustration of the use of our proposed framework 
for different objective functions and tasks other than reconstruction. 

The rest of the paper is organized as follows: In Section |2| we present our 
approach to robust FCA and discus exact optimization algorithms to solve it. In 
Section [3) we introduce the new robust encoders and the new objective functions 
used for their training. We also discuss the online setting and the possibility to 
incorporate geometric transformations. In SectionHj we present several experi- 
mental results. Conclusions are drawn in Section [51 



2 Online RPCA via non-convex factorization 

In this paper we tackle the RFCA problem by solving the unconstrained opti- 
mization problem 



min i||X-L-Of^ + A.||LL + A||0||i. (2) 



This formulation is equivalent t o (IT] ) in the sense that for every e > one can 
find a A* > such that ([T]) and ([2j) admit the same solutions. 

As the ^i-norm encourages sparsity with vectors, the nuclear norm promotes 
low rank in matrices. [Recht et ah] (2010) showed that the nuclear norm of 



a matrix of rank(L) < q can be reformulated as a penalty over all possible 
factorizations 



min^||U||^ + i||S||^ s.t. US = L, (3) 



U e R^x^, S G ]R^><^. The minimum is achieved by the SVD: if L = U^XIVj 
then the minimum of (|3| is U = Ul5]2 and S = 5]2 V^. This factorization 
has been recently exploited in parallel processing across multiple processors to 
produce state-of-the-art algorithms for matrix completion problem s |Recht 



Re (2011), as well as an alternative approach to robustifying FCA in Mateos 



Giannakis (2011). 
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In ([2|, neither the rank of L nor the level of sparsity in O are assumed 
known a priori. However, in many applications, it is a reasonable to have a 
rough upper bound of the rank, say rank(L) < q. Combining this with (|3|, we 
can reformulate ([2| as 



mjn i ||X - US - 0||^ + ^(||U||^ + ||S||^) + A ||0||, , (4) 



with U G R^><^ S G ]R^><^, and O G M^><^. This decomposition reveals a lot of 
structure hidden in the problem. The low rank component can now be thought 
as an under-complete dictionary U, with q atoms, multiplied by a matrix S 
containing in its columns the corresponding coefficients for each data vector in 
X. This interpretation brings our problem close to that of dictionary learning 
in the sparse modeling domain Mairal et al. ( |2QQ9 ). 



While this new factorized formulation drastically reduces the number of op- 
timization variables from 2nm to q{n -h m), problem Q is no longer convex. 
Fortunately, it can be shown that any stationary point of (|4l), {U, S, O}, satis- 



fying ||X - US - OII2 < A* is an optimal solution of Q Mardani et aL ( |2Qll| 
Thus, problem Q can be solved using an alternating minimization or block 
coordinate scheme, in which the cost function is minimized with respect to each 
individual optimization variable while keeping the other ones fixed, without the 
risk of falling into a stationary point that may not be globally optimal. This 
will be exploited to design our fast encoders. 

2.1 Robust low dimensional projections 

Let us assume that we have already learned a low dimensional model, U G R"^^^, 
from some data X ^ US + O G R^^"^. Suppose that we are given a new 
input vector x G R^ drawn from the same distribution as X. Then x can 
be decomposed as x = Us + n + o, where Us represents the low dimensional 
component, n is a small perturbation and o is a sparse outlier vector. We 
propose to do it by extending Q 



mm^^ \ ||x - Us - o\\l + ^ ||s||^ + A \\o\\, . (5) 



Unlike dictionary learning problems Mairal et al. (2009), here the columns of the 



dictionary U are not constrained to have unit norm. In fact, the differences in 
the norms of the different atoms play a crucial role in the estimation weighting 
the relevance of the atoms in the low dimensional distribution and appearing in 
the objective function as the quadratic term ||s||2. To give some further intuition 
we analyze the program ([5| and its relation with the possible solutions of Q. 
As discussed in the previous section, if the upper bound q for the rank of the 
true low dimensional model is correct, any pair of matrices {U, S} found as a 
stationary point of Q and satisfying ||X — US — 0||2 < A*, is guaranteed to 
satisfy L = US. For simplicity in the notation and without loss of generality, 
in the sequel we assume that the rank of L is exactly q. Then, the solution of 
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input : Data x, dictionary U, parameters A* and A. 
output: Coefficient vector s and outlier vector o. 
Define H = (U^U - A J)-i and W = UHU^, A = Al. 
Initialize y = and b = (I — W)x. 
repeat 

o = 7rA(b) 

b = b + W(o-y) 

y = o 

until until convergence; 
Output o and s = H(x — o). 

Algorithm 1: Alternating minimization scheme for solving ([s] 



program ([5|, {s,o}, satisfies Us = Us2 and = 6, where {s, 6} is the solution 
obtained by substituting U by U = XJlT,^ in ([5|. Applying the change of 
coordinates s = T,~^w^ this new problem can be written as 

min I ||x - Ulw - o||^ + ^ V ^ + a ||o||, , (6) 

i—1 

where the Wi^s are the individual coefficients of w and the a^'s are the singular 
values of L, X) = diag((Ti, . . . , aq). Note that Us = U^w. The second term in 
the cost in ^ acts as a regularizer that encourages the use of the coefficient of 
w corresponding to the dominant directions (larger singular values) of L. 

The robust low dimensional projection ([5| is a convex program that can be 
solved using several methods. We are interested in choosing an optimization al- 
gorithm that can be further used to define the architecture of trainable encoders 
for simultaneously estimating s and o. With this in mind, we choose to use the 
alternating minimization scheme, described in Algorithm [l] The solution of ^ 
is given by s = (U^U — A*I)~"'^U^(xt — o) and o = 7rA(x — Us), when fixing 
o and s respectively. Here ttx is the scalar soft-thresholding operator with pa- 
rameter A G M^, which applies a soft-threshold A^ to each component of the 
input vector. In this case, A = Al. 



2.2 Online RPCA 

In Section [2] we assumed that the entire data matrix X was available a priori 
We now address the case when the data samples {x^jt^i^, x^ G M^, arrive 
sequentially; the index t should be understood as a temporal variable. Online 
RPCA aims at estimating and refining the model as the data comes in. The need 
for online algorithm appears naturally in a various applications, e.g., when large 
volumes of data are permanently generated over time. Other applications aim 
at estimating models for dynamic data constantly changing over time. Finally, 
online learning has also been extensively used when the available training data 



are simply too large to be handled together [Mairal et al. (2009). 
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We propose to address online RPCA extending the approach presented in 
Section [2j An alternating minimization algorithm for solving the online coun- 
terpart of Q goes as follows: When a new data vector is received, we first 
obtain its representation {s^,o^} given the current model estimate, \Jt-i- 

{st, ot} = argmin^ ||xt - Ut_iS - o\\l + ^ ||s||2 + A \\o\\^ . (7) 

s,o ^ 

Then, we update the model using the projections, {sj}j<t and {oj}j<t, com- 
puted during the previous steps of the algorithm, 

Vt = argmin^^/?, Q||x,-Us,-o,||^ + ^||Uf^^, (8) 

where G [0, 1] is a forgetting factor that can be added to rescale older infor- 
mation so that newer estimates {sj^Oj} have more weight. 

Problem ([7| is identical to ([5| and can be solved using Algorithm \\\ setting 
U = Ut-i and x = x^. There are two major approaches to solving The 
first one is to solve it recursively from previous estimates, using strategies such 



as block-coordinate descent methods with warm restarts Mairal et al. (2009) 



or recursive least squares Mateos & Giannakis (2011). The other option is to 



directly solve the system of equations 

=i2l3j{^j-Oj)sJ. (9) 

While the recursive strategies have, in general, lower computational complexity, 
in particular for large scale problems, they require more storage. The choice of 
the dictionary update strategy is, therefore, application-dependent. 




3 Online RPCA via fast trainable encoders 



As mentioned in the Introduction, one of the main contributions of the present 
paper is the construction of trainable regressors capable of approximating the 
solution of ([5| for a given fixed dictionary U (the latter will be updated as 
well as shown in the sequel). The main idea is to build a parametric regressor 
z = (s,o) = h(x, 0), with some set of parameters, collectively denoted as 0. 
Thus, we need to define an architecture for h and a learning algorithm in order 
to determine 0. 



Following the fast sparse coding methods in Gregor & LeCun (2010); Sprech- 
(2012), we propose to use feed-forward multi-layer architecture 



mann et al. 



where each layer implements a single iteration of an exact solver of the prob- 
lem. In this case we use the tailored alternating minimization scheme described 
in Algorithm [l] The parameters of the network are the matrices W and H and 
the thresholds A (extra flexibility is obtained by learning different thresholds 
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for each component). The encoder architecture is depicted in Figure 2 in the 
supplementary material. Each layer essentially consists of the nonlinear thresh- 
olding operator tta followed by a linear operation W. The network parameters 
are initialized as in Algorithm [l] 

As a learning strategy, we propose to select the set of parameters that 
minimizes the loss function, 

on a training set X = {xi, . . . , x^}. Here, L(0, Xj) is a function that measures 
the goodness of the code Zj = h(xj,0) produced for the data point Xj. We 
will discuss bellow several different options for choosing L. The selection of the 
objective function L sets the type of regressor that we are going to obtain and 
this is clearly application dependent. 

One of the most straightforward choices is to use L(0,Xj) = ||zj — z*||, 
with z* = (s*,o*) being the j-th columns of the decomposition of the data 
X = (xi, . . . ,Xn) into X = US* + O* by the exact RPCA. This essentially 
trains the encoder to approximate the exact solution of the RPCA problem. In 
other applications, the data may not completely adhere to the assumptions of the 
RPCA model, and the exact solution is, therefore, not necessarily the best one. 
This is the case in the source separation problem discussed in the introduction, 
where RPCA gives a very good separation of the spectrally sparse singing voice 
and repetitive low-rank background accompaniment, yet the obtained signals 
are still not equal to the true voice and background tracks. In this case, one 
could use a collection of clean voice and background tracks, {s*} and {o*} 
respectively, to supervise the training, often achieving better separation results. 
Other choices of the loss function are discussed in the sequel. 

We perform the minimization of a loss function >C(0) with respect to us- 



ing a stochastic gradient descent, as in Gregor & LeCun (2010). Specifically, we 



iteratively select a random subset of X and then update the network parameters 
as ^ — /i ^^^^ , where /i is a decaying step, repeating the process until con- 
vergence. This requires the computation of the (sub)gradients (i£(0, Xt)/(i0, 
which is achieved by a back-propagation procedure. 



3.1 Online learning 

The robust projection ([5| can be viewed as a mapping between a data vector x 
and the corresponding pair z = {s, o} minimizing the cost function, 

/(x, z) = i ||x - Us - o\\l + ^ llsll^ + A ||o||, . (11) 

This objective is trusted as an indication of the decomposition quality as ex- 
plained in Section |2.1[ Then, the network can be trained to minimize the en- 
semble average of / on a training set with z = argmin/(x, z) replaced by z = 
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2.2 



h(x, 0). This results in the training objective (10) by selecting Lon(€),Xj 
/(xj, h(xj, 0)) and adding a forgetting factor /3j as described in Section 

When the training of the regressors can be done online, one can further 
consider the online adaptation of the dictionary. This can be done simply by 
treating U as another optimization variable in the training, and minimizing 
C with respect to both U and the network parameters, alternating between 
network training and dictionary update iterations. In this setting, the model 
adaptation is equivalent to ([8|. This essentially extends the proposed framework 
into a full-featured online RPCA encoder, trained on the very same data fed to 
it for robust low dimensional projections. In this setting, our regressors can be 



interpreted as an online trainable sparse auto-encoder Goodfellow et al. (2009) 
with a multi-layer non-linear encoder and simple linear decoder. The higher 
complexity of the proposed architecture in the encoder allows the system to 
produce accurate estimates of true structured sparse codes. 

3.2 Geometric transformations 

The underlying model used in RPCA relies on the critical assumption that 
the given input vectors X are "aligned" with respect to each other. While 
this assumption holds for many applications (i.e., background subtraction with 
static cameras), it does not apply in all cases. The canonical example is face 
modeling, where the low dimensional model only holds if the facial images are 
pixel- wise aligned Peng et al.| p010). Even small misalignments can break the 
structure in the data, the representation quickly degrades as the rank of the 
low dimensional component US increases and the matrix of outliers O loses its 
sparsity. 

This challenging problem has been recently studied in the literature. In 
'Kemelmacher-Shlizerman & Seitz" (2011 ) the authors propose a pre-processing 



strategy to align the training images. In [Peng et al.| ( |2010} , the authors simul- 
taneously align the input images and solve RPCA with a sequence of convex 
problems. 



Following Peng et al. ( 2010| ), we propose to incorporate the optimization over 



geometric transformations of the input data into the representation problem. 
Then, the optimization problem (|4| becomes 

min I ||T„(X) - US - 0||^ + ^(||U||^ + ||S||^) + A ||0||, , (12) 

U ,0,0,0: Z Z 

where Tq, is a parametrized operator (with a set of parameters collectively de- 
noted as a = [a-"-, . . . , a"]) that applies different geometric transformation, T^i, 
to each training vector x^. This formulation is highly non-convex and difficult 
to optimize. Interestingly, the framework of trainable regressors introduced in 



Section 3.1 is very well suited for producing accurate approximations of (12) at 
very mild extra computational expenses. We propose to use the training objec- 
tive function defined in ([lO| with LTr(0, x^-, a^) = /(T^j (x^), h(T^j (x^), 0)). 



The obtained regressors are conceptually very similar to the ones we had before 
and can still be trained in an online manner. When a new data vector x^ arrives. 
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Table 1: Robust PC A representation accuracy (in the sense of the £2 +^1 cost) 
of the faces data using different encoders. The cost for the exact encoder is 
1.290. 



Encoder 


Untrained 


Supervised 


Unsupervised 


Unsupervised 
H-U update 


Single layer 


1.3355 


1.3471 


1.3460 


1.3262 


2 layers 


1.3248 


1.3261 


1.3255 


1.3171 


10 layers 


1.2968 


1.2977 


1.2969 


1.2885 



we compute it's robust low rank projection by minimizing LTr(€), x^, a*) over 
a the vector a* parameterizing the geometric transformation. Here, h(x^,0) 
is almost everywhere different iable with respect to their input x^, which allows 
to find the (sub)gradient with respect to a* by applying the chain rule. In the 
same way, as new data arrives, the transformation of all the previously seen 
training vectors is updated through the minimization of a loss function £(0, a) 



with respect to a, following the same ideas in Section 3.1 This strategy can also 



be used in the standard RPCA scenario, however, the representations z = {o, s} 
themselves are minimizers of a convex problem, making the minimization with 
respect to a cumbersome and computationally expensive. 



4 Experimental results 

In what follows, we evaluate the proposed RPCA encoders on image, video, and 
audio data. Due to lack of space, only the essential details of the experiments 
are given; the reader is referred to the cited references for further details of the 
experimental settings that were reproduced here. 

Coding performance: Quality of different robust PCA encoders was evalu- 
ated on a dataset consisting of 800 66 x 48 images of a female face photographed 
over the timespan of 4.5 years, roughly pose- and scale-normalized and alignedj^ 

-"^The original video can be found at http: //www.youtube . com/watch?v=02e5EWUP5TE. 
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Original 



Misaligned 



Optimally aligned 






n&nn utMk nc^nn 




0.8931 1.8981 1.4502 1.1015 



2.8364 3.6438 3.6123 2.9722 



0.8209 1.8518 1.3844 1.0884 



Figure 2: Robust PC A representation of the faces dataset in the presence of 
geometric transformations (misaUgnment). Left group: original faces; middle 
group: shifted faces; right group: faces optimally realigned during encoding. 
First row: reconstructed face Us + o; middle row: its low-rank approximation 
Us; and bottom row: sparse outlier o. The £2 + ^1 cost is given for each 
representation. 

Neural networks with different number of layers were trained on 500 vectors from 
the faces dataset. The following training objectives were used: the £2 discrep- 
ancy between the exact representation s* and o* (referred to as Supervised); 
the £2 + £1 objective (11); and the £2 + £1 objective combined with the online 



update of the dictionary U (initial dictionary was computed using standard 
SVD). Parameters were set to A* =0.1 and A = 10~^. For reference, we also 
report the results produced by the exact Algorithm [l] and an untrained net- 
work with W, H and A initialized according to Algorithm [l] (being effectively 
a truncated version of the algorithm). The obtained representations are visual- 
ized in Figure 3 in the supplementary material. Table [l] summarizes the quality 
of the representations in terms of the £2 + £1 cost (lower numbers correspond 
to better quality). Note how sufficiently deep encoders with dictionary update 
slightly outperform the exact encoder without dictionary adaptation. Also note 
that using a neural network encoder to approximate the exact representations 
slightly degrades the £2 + £1 measure compared to the untrained encoder. 
Online learning: In this experiment we evaluate the online learning capabil- 
ities of the proposed neural network encoders. As the input data we used the 
time-ordered sequence of 800 images from the faces dataset. Online learning 
was performed in overlapping windows of 100 images with a step of 10 images. 
We compared the exact algorithm, a five layer neural network encoder trained 
offline using the £2 + £1 objective {NN offline)^ and the same encoder trained 
online with adaptive U. The dictionary was initialized using SVD. Performance 
measured in terms of the exact cost ([2| is reported in Figure [l] The exact offline 
encoder is consistently slightly inferior to the exact algorithm. However, thanks 
to its capabilities to adapt to the changing data distribution, the online encoder 
starts outperforming the offline counterparts after a relatively brief period of 
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Figure 3: Robust PC A representation of several frames from the surveillance se- 
quence obtained using the algorithm in |Lin et ah] ( |2009[ ) (left group) , Algorithm [l] 
(middle group), and a five layer neural network encoder (right group). Columns in 
each group are, left-to-right: the reconstructed frame Us + o, its low-rank approxima- 
tion Us (background), and the sparse outlier o (foreground). Each row corresponds 
to a different frame, 
initial adaptation. 

Geometric transformations: We now evaluate the representation capa- 
bilities of the proposed neural network encoder in the presence of geometric 
transformations. A five layer encoder was trained on 600 images from the faces 
dataset. As the test set, we used the remaining 200 faces, as well as a collection 
of geometrically transformed images from the same test set. Sub-pixel planar 
translations were used for geometric transformations. The encoder was applied 
to the misaligned set, optimizing the £2 + ii objective over the transformation 
parameters. For reference, the encoder was also applied to the transformed and 
the untransformed test sets without performing optimization. Examples of the 
obtained representations are visualized in Figure |2j Note the relatively larger 
magnitude and the bigger active set of the sparse outlier vector o produced 
for the misaligned faces, and how they are re-aligned when optimiziation over 
the transformation is allowed. Since the original data are only approximately 
aligned, performing optimal alignment during encoding frequently yields lower 
cost compared to the plain encoding of the original data. 

Video separation: Figure [3] shows background and foreground separation via 
robust PCA on the surveillance video sequence ^^Hall of a business building^^ 
( [20041 ). 



taken from Li et al. 



The sequence consists of 88 x 72 images of an 
indoor scene shot by a static camera in a mall. The scene has a nearly constant 
background and walking people in the foreground. We used networks with five 
layers and q = 5 trained to approximate the output of the exact RPCA on a 
subset of the frames in the sequence. Parameters were set to A* = 0.1, A = 10~^. 
The separation produced by the fast encoder is nearly identical to the output of 
the exact algorithm and to the output of the code from|Lin et al. (2009), used 



as reference, while being considerably faster. Our Matlab implementation with 
built-in GPU acceleration executed on an NVIDIA Tesla C2070 GPU propagates 
a frame through a single layer of the network in merely 92 jasec. This is several 
orders of magnitude faster than the commonly used iterative solver executed on 
the CPU. 
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Table 2: Performance of audio separation methods on the MIR- IK dataset. 



Method GNSDR GSNR GSAR GSIR 



Ideal freq. mask 




13.48 


5.46 


13.65 


31.22 


ADMoM RPCAjHuang et al. 


(2012) 


5.00 


2.38 


6.68 


13.76 


Proximal RPCA 




5.48 


3.29 


7.02 


13.91 


NN RPCA Untrained 




5.30 


2.66 


6.80 


13.00 


NN RPCA Unsupervised 




5.62 


2.87 


6.90 


14.02 


NN RPCA Supervised 




6.38 


3.18 


7.22 


16.47 



Audio separation: We evaluate the separation performance of the proposed 
methods on the MIR-IK dataset [Hsii fc Jang| ( [2010D , containing 1000 16 KHz 
clips extraced from 110 Chinese karaoke songs performed by 19 amateur singers 
(11 males and 8 females). Each clip duration ranges from 4 to 13 seconds, to- 
taling about 133 minutes. We reserved about 23 minutes of audio sang by one 
male and one female singers {abjones and amy) for the purpose of training; the 
remaining 110 minutes of 17 singers were used for testing. The voice and the 
music tracks were mixed linearly with equal energy. The experimental settings 
closely followed that oflHsu & Jang (2010), to which the reader is referred for 



further details. As the evaluation criteria, we used the BSS-EVAL metrics |Vin- 



|cent et aT] ( |2QQ6| ), which calculate the global normalized source-to- distortion 
ratio (GNSDR), source-to- artifacts ratio (GSAR), source-to-interference ratio 
(GSIR), and signal-to-noise ratio (GSNR). All networks used 20 layers with 
q = 25. The following training regimes were compared: untrained parameters 
initialized according to Algorithm [l] ( /7n^rame(i); unsupervised learning with 
the objective ^ (Unsupervised); and training supervised by the clean voice 
and background tracks (Supervised). For reference, we also give results of ideal 
frequency masking as well as that of two exact RPCA algorithm minimizing ([2| 
using proximal splitting, and its noiseless version using augmented Lagrangian. 
Table [2] summarizes the obtained separation performance. While unsupervised 
training makes fast RPCA encoders on par with the exact RPCA (at a fraction 
of the computational complexity and latency of the latter), significant improve- 
ment is achieved by using the supervised regime. We intend to release a demo 
iOS application capable of performing the separation online and in real-time on 
a hand- held device. 



5 Conclusion 

By combining ideas from structured non-convex optimization with multi-layer 
neural networks, we have developed a comprehensive framework for the online 
learning of robust low-rank representations in real time and capable of handling 
large scale applications. The framework includes different objective functions 
that allow the use of the encoders to solve challenging alignment problems at 
almost the same computational cost. A basic implementation already achieves 
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several order of magnitude speedups when compared to exact solvers, opening 
the door for practical algorithms following the demonstrated success of robust 
PC A in various applications. Finally, robust nonegative matrix factorization 
can be obtained using very similar architectures. 
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